Introduction

This project focuses on exploratory data analysis of white wine data set downloaded from - https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityWhites.csv

The aim of this project is to analyze, which physicochemical properties such as alcohol content, acidity, sugar content etc. have effect on wine quality.

First, we begin with exploring data on broader sense and obtaining basic information.

## [1] 4898   13
## [1] 4898
## [1] 13

We can see that our data set comprises of 4898 rows and 13 columns. The column names are as listed below -

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Before, we continue further we will drop X column or the id column which represents row number or the record number.

Next, we take a look at high-level, non statistical summary of entire data frame.

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

From the above results, we can see that we have all the 13 columns belonging to numerical or integer datatype. Next we look at the statistical summary of the data set.

##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000

From the above statistical summary, we can see that other than citric acid, no other variables/columns have records with value starting with 0. Also, We do not have any NA’s listed for any variable, hence we can conclude that we do not have any missing values in our data set.

Before doing further analysis, we will look into the SPLOM, Histograms and correlations of our data set by using pairs.panels function. The reason to do this first is, as we will do further analysis, we would like to see if there is any necessity to remove outliers for certain columns based on the correlation between the variables.

From the above visualization, we can see that the variables fixed.acidity, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH and alcohol have weak to strong correlations with other variables. So we will remove outliers from these variables if necessary. The variables like volatile.acidity, citric acid and sulfates have no relationship with other variables so we will not consider removing outliers from those variables. Later in bivariate analysis, we will again take a look at the correlation matrix after transformations and outlier removals.

We will do further analysis of all the variables. First, we begin with analyzing the quality.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

From the above output, we can see the distribution of the quality for different values. We now analyze quality using histogram.

From the histogram and table output, we can see that most of the wine quality has score of 6. Surprisingly, there are no wines with score of 10 and with scores between 0 - 2. There are only 20 wines with score of 3 and only 5 wines with score of 9. We don’t worry about the outliers for quality as this is going to be our dependent variable.

For further analysis of the data we will create a new factor variable labeled as factor_wine_quality based on quality. For the new column we will classify - wines with score 5 and less than 5 under the category of “Low Quality”, wines with score of 6 will be under category of “Medium Quality” and wines with score greater than 6 will be under category “High Quality”.

We will analyze the new variable factor_wine_quality to see how many wines are under “Low Quality”, “Average Quality” and “High Quality”.

## 
##    Low Quality Medium Quality   High Quality 
##           1640           2198           1060

From the above results, we can see that most of the wines in our data set fall under “Medium Quality”. It is interesting to note that even though medium quality wines comprises of only the wines with quality score 6 they still constitute a larger potion of the distribution.

Next, we will take look at the fixed acidity column. Fixed acidity column consists of data of acids present in the wine which are fixed or nonvolatile (which do not evaporate readily). We want to see how fixed acidity would affect the quality.

From the histogram, we can say that Fixed acidity is fairly normally distributed. From the graph we can see that the fixed acidity varies from 3.8 to 14.2 g/dm^3. Most of the records are between 4.4 g/dm^3 and 10 g/dm^3. We will still see if there are any outliers present in the data.

We can see from the above box plot, that there are outliers present for the fixed_acidity column.

We will analyze how many outliers we have using Interquartile ranges i.e. if a data point is below Q1 - 1.5×IQR or above Q3 + 1.5×IQR, it will be considered as an outlier.

The number of upper outliers is -

## [1] 123

The number of lower outliers is -

## [1] 23

Even though we have 126 outliers using IQR, removing all of them does not make any sense as we can see from histogram and correlations that only few data points are away from normal distribution. So, we will consider working on data set by just extracting 2 percentile to 98 percentile of attribute values.

The above box plot is the box plot for fixed.acidity after removing the outliers. Depending on further analysis, we will determine if there is need to remove remaining outliers.

We then move on to analyzing volatile acidity. Volatile acid refers to the amount of acetic acid in the wine, which at too high levels can lead to an unpleasant, vinegar taste.

The data varies from 0.08 gm/dm^3 to 1.1 gm/dm^3. From the histogram, we see that the histogram is little bit right skewed so perhaps a transformation like square root would be appropriate here.

This histogram looks far better now. We can see that the data is distributed between 0.29 g/dm^3 to 1.05 g/dm^3 approximately. We do not worry about outliers for the transformed data as we know that volatile acidity does not have any correlation with other variables. We will consider removing outliers later if required in bivariate analysis.

We now begin analyzing citric acid.

From the histogram, we can see that the data varies between 0 to 1.66. The distribution looks somewhat normal, so we do not need to apply any transformation here. Since there is no correlation between citric acid and other variables, we do not consider removing outliers.

After citric acid, we take a look at the residual sugar. Residual sugar is the amount of sugar remaining after fermentation stops.

It’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet. From histogram, we can see have 75 wines where residual sugar less than 1 gram/liter with value 0.75 gram/liter and 1 wine greater than 45 grams/liter with value 65.75 gram/liter of residual sugar. We can see that the histogram is right skewed, so here we would try log transformation on Residual sugar to see if it changes our understanding of the data

We can see that now we have a bimodal distribution with two peaks at 0.125 and 0.875. We can see the distribution now begins with small negative value of -0.225 and proceeds till 1.825. Next we try to analyze if there are any outliers present for the new transformed data.

As seen from the box plot, Our transformed data for residual sugar has no outliers. Next, we look at the chlorides which refer to the amount of salt in the wine.

The data has normal distribution with certain outliers present. As seen from the histogram, the data varies from 0.013 to 0.345. It would be interesting to see how amount of chlorides/salt affect the wine quality. We will look at that in bivariate and multivariate analysis.

We will now see if chlorides has any outliers present.

From the boxplot, it can be seen that there are considerable outliers present in the data. As we can see from the correlation plot and normal distribution, we do not need to remove all the outliers instead we will follow the same technique as used before of considering the data between 2 and 90 percentile values.

After comparing both the box plots we can see that the max value of the outliers has decreased from 0.346 to 0.16. If need arises to remove all the outliers, we will consider that in bivariate and multivariate analysis.

After chlorides, we will analyze free sulfur dioxide present in the alcohol. The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

From the histogram, we can see that the data varies from 3 to 289 units. Overall it has normal distribution. Since free sulfur dioxide has strong correlations with other variables we take a look at the outliers.

From the box plot it is pretty evident that there no outliers below 1st quartile, and there is only point greater than 150, so here only the data points below 99 percentile are considered for further analysis.

From the new box plot we can see that most of the outliers are removed. If need arises, we would consider removing the remaining outliers in bivariate or multivariate analysis.

We then proceed with the total sulfur dioxide variable which refers to amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

The data is pretty much normally distributed varying from 17.5 ppm to 367.5 ppm. It would be interesting to see how much total SO2 is present in high quality wines. We would explore that in bivariate and multivariate analysis.

For looking at the outliers we will look at the box plot.

We can see that there are very few outliers present. We will use Interquartile range method to detect number of outliers. The number of upper and lower outliers using IQR are -

## [1] 9
## [1] 2

Since the number of outliers are very small, we will remove all the outliers here and plot a new box plot.

After outliers are removed, we move on to the another variable - density. The density of wine is close to that of water depending on the percent alcohol and sugar content

The data varies from 0.98725 to 1.03875. We can say that density of all the wines in our data set is pretty much close to the density of water which is 1 g/cm^3. We can see from the histogram that the data is normally distributed with just few points not part of the normal distribution. We will examine the outliers using boxplot.

We see that we just have 3 outliers so we will remove them using InterQuartile Range method.

After looking at density, we will take a look at the column pH. It describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

We can see that the values for pH varies from 2.79 (highly acidic) to 3.81 (medium acidic). The distribution is quite normal. We will see if there are any outliers present for pH.

We can see that there are outliers present for pH. We can find out the number of outliers using IQR method. The number of upper and lower outliers is as follows -

## [1] 57
## [1] 5

We can see from the SPLOM and the distribution that there only couple of points that need to be removed. Hence we will again consider working with 2 percentile to 98 percentile of the data.

We can see from the new box plot that most of the outliers are removed and we will consider removing remaining outliers if needed in bivariate and multivariate analysis.

Next, we will consider analyzing sulfates. Sulfates is a wine additive which can contribute to sulfur dioxide gas (SO2) levels, which acts as an antimicrobial and antioxidant.

From the histogram, we can see that histogram again is fairly normally distributed and the data ranges from 0.23 to 1.08. Since sulfates do not have any correlation with other variables, we would not look at the outliers.

At the end, we look at the very important chemical in wine which is alcohol. We have percent alcohol content of the wine in the data which is distributed as follows -

From the histogram, we can see that the alcohol is kind of a multimodal distribution which varies from 8.4 to 14.2. Most of the wines have between 9 - 10 percent content of alcohol. It will be interesting to see whether high quality wines have higher content of alcohol.

To see if there are any outliers, we will examine its box plot.

It is good to see that there are no outliers present for alcohol data.

Univariate Analysis

What is the structure of your dataset?

The data set before importing into R was in .csv format and represents white wine data set with 4898 rows and 12 columns. Each row represents the attributes that can affect the wine quality. It represents the data which gives information on what proportion of each attribute would result in the wine quality that would score from 3-9.

What is/are the main feature(s) of interest in your dataset?

In particular my interest is in finding out how wine quality is affected by other different variables in the data set. In particular, based on our correlation matrix it would be interesting to see how much the percent content of alcohol, density of wine, pH, amount of nonvolatile acids, residual sugar would affect the quality of the alcohol.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

The other features in the data set that will help support the features of interest will be the amount of sulfur dioxide, chlorides/salt a well as different combinations of the features listed above in analyzing the quality of wine.

Did you create any new variables from existing variables in the dataset?

Yes, 3 new variables labeled factor_wine_quality, new.volatile.acidity and new.residual.sugar are created from existing variables in the data set. factor_wine_quality is the factor variable based on the existing variable quality. new.volatile.acidity is the new transformed data for existing variable volatile.acidity, new.residual.sugar is the new transformed data for existing variable residual.sugar.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form 

of the data? If so, why did you do this? Yes, volatile acidity and residual sugar were right skewed distributions. To change the form of data, sqrt transformation and logarithmic transformation was performed and new columns were created to represent the transformed data. The reason to do transformation was to achieve normality prior to the use of the linear regression model.

Bivariate Plots Section

We begin the bivariate analysis by looking at the correlation matrix of the existing columns as well as new transformed columns added in the data set.

From the correlation matrix it can be seen that the alcohol has the highest correlation with the quality of 0.44 in terms of relationship between all the independent variables and the dependent variable quality. Alcohol has also has the largest negative correlation with density ( - 0.81) and density has the second highest correlation with the quality with correlation of -0.31 which is bit obvious since it is negatively correlated with alcohol. Density is also strongly correlated with residual sugar with correlation 0.78 and we can also see some medium correlation between total.sulfur.dioxide and alcohol (-0.46) and total.sulfur.dioxide and residual.sugar(0.43) and total.sulfur.dioxide and density(0.55) as well as chlorides and alcohol (-0.42). There are also some weak correlations such as fixed.acidity and pH (-0.38), free.sulfur.dioxide and residual.sugar(0.36). There is also a very obvious relationship between free.sulfur.dioxide and total.sulfur.dioxide with correlation of 0.61.

Let us begin investigating first the relationship of all the variables with quality.

First we begin with investigating the relationship between fixed acidity and quality. We know from the correlation matrix, that there is no correlation between two variables (0.08). We will try to visualize the distribution of fixed acidity for different quality of wines using scatter plot.

From the scatter plots we do not see much difference in the fixed acidity proportion based on the quality of the wine.

From the above frequency polygon we can see that the maximum count of fixed acidity for all the wine qualities occur approximately at 6.758 g/dm^3. Overall the lowest value of fixed acidity is approximately 5 g/dm^3 for all the wines and the highest values is approximately 9.20 g/dm^3

Compared to scatter plot and frequency polygon. box plot gives better visualization of fixed acidity distribution based on wine quality. The median of all the wine qualities is same i.e. 6.8 g/dm^3. The range of values for Low quality and medium quality wines lie between 5.1 and 8.5 whereas for high quality wine the range of values is 5.1 - 8.5. The highest value of fixed acidity is 9.1 found in both high quality as well as low quality wines. High quality wine has more outliers compared to medium and low quality wines.

Next we move on analyzing volatile.acidity and the quality of wines.

The scatter plot does not reveal much information about the relationship between volatile acidity and quality. We can just see that for wine with quality score 4 or low quality wine has the highest volatile acidity value.

Next, we look at the frequency polygon, where we can see that maximum count of volatile acidity for low and medium wine qualities is approximately same. Whereas the maximum count of volatile acidity for high quality wines is less compared to other two wine qualities. Since we are analyzing with transformed data here, we do not have actual values. Overall, we can say that the volatile acidity is lower in wines with high quality or score greater than or equal to 7 which should be obvious because too high of levels can lead to an unpleasant, vinegar taste.

We can see from the above box plots that the Low quality wines have high median value compared to Medium and High Quality wines. Surprisingly, there are also some values or outliers in high quality wines which are larger than values in medium quality wines and the overall range of high quality wines is larger than the medium quality wines.

Next, we move on to analyze the citric acid.

Most of the wines have less or almost zero amount of citric acid. There is an outlier in the medium quality wine which has the highest value of 1.660 units.

From the above frequency polygon, we can see that the maximum count of citric acid values in all the wines is approximately equal to 0.29 units. Overall there is no indication of the relationship that can be figured out here.

We can see that the range of values for citric acid is low for High quality wines and the highest for Low Quality wines. The highest amount of citric acid can be seen in Low quality and Medium Quality Wines. The median value is approximately same for all the wines i.e. 0.31 units. Overall it seems that there is a balance of citric acid in high quality and medium quality wines.

One of the important factors which I think is important in wine quality is residual sugar.

We can safely say from the scatter plot that medium quality wines and low quality wines have more residual sugar compared to high quality wines. Since it is transformed value there are negative records present in the chart.

Again, all the wines in the data set have approximately the same maximum of residual sugar. We can see that it is a bimodal distribution.

From the boxplot, we can see that median value for High quality and medium quality sugar are smaller compared to low quality sugar. The range of the values for High quality and medium quality wines is also smaller than the low quality wines. It can be safely assumed that in our data set that High quality wines have low amount of residual sugar.

Next we will analyze how the amount of chlorides or salt help determine the quality of wine.

We do see that the high quality wines have less chlorides than the medium and low quality wines.

The frequency polygon is interesting here. The highest count of chlorides is found at the value approximately equal to 0.038 for high and medium quality wines. For low quality wines the most common value of chlorides is 0.048.

The box plots here provide a pretty good idea about the chlorides proportion in the wines. We can see that median is smaller for medium and high quality wines. Overall the range of chlorides in high quality wines is smaller than the range in medium and low quality wines.

Next we look at how free sulfur dioxide varies for wine qualities.

The relationship is vague in scatter plot between free sulfur dioxide and quality. The highest value of sulfur dioxide is found for medium quality wine and the lowest value is found for low quality wines.

The most common value of sulfur dioxide in the medium quality wines is approximately 32 and for low and high quality wines is around 34.9 units.

The median values are approximately same for all the wines. The range of free sulfur dioxide is smaller compared to medium and low quality wines.

Not much to conclude we move on to total sulfur dioxide.

It can be seen that amount of total sulfur dioxide is smaller in high quality wines than the low and medium quality wines.

The common values for different qualities of wines differ here. Most common value for medium quality wines is at 110.55 ppm, for high quality wines the most common value is approximately equal to 126.35 ppm and for low quality wines the most common value is 134.25 ppm.

We can say that the median and range are smaller for high quality wines compared to Low quality and medium quality wines.

We will next see how much variance in density affects the quality of the wines.

From the above scatter plot we can see that high quality wines have lower density compared to medium and low quality wines. The highest density is found in low quality wines.

We can see that the most common value of density for all the wine qualities lies approximately between 0.991 to 0.995 which is pretty much same as that of water(1 g/cm^3).

With box plot, we can see clear distinction between densities for each wine quality. High quality wines have smaller range and median value compared to medium and low quality wines. The highest value of density for high quality wine is 1.0006 g/cm^3 whereas it is 1.0024 g/cm^3 for low quality wine and 1.0017 for medium quality wines. The lowest value of density can also be found in high quality wines which is 0.9871 g/cm^3.

Having found stronger variance in density with respect to the quality we move on to the pH.

We do not see much relation between pH and wine qualities from scatter plot here.

The most common values of pH for medium and low quality wines are approximately equal to 3.19 and most common value of pH for high quality wines is 3.26 which means high quality wines can be slightly more basic than the medium and low quality wines.

The median for High quality wines is higher than low and medium quality wines which is in sync with our conclusion previously that high quality wines are slightly more basic than the low and medium wines. The range is pretty much same for medium and high quality wines, it is however smaller for low quality wines.

From the scatter plot we cannot see much relationship between sulfates and the wine quality. Overall, we can say that the the highest value of sulfates is found in high quality wines and the lowest value of sulfates can be seen in medium quality wine.

The most common value of sulfates in low quality wine is 0.47 units approximately. The common value of sulfates in medium quality wines is 0.50 approximately and the common value of sulfates in the high quality wines is 0.38 units.

We can see that the range of sulfates is higher in high quality wines and lowest in low quality wines. The median is approximately same.

Next we look at the most ingredient chemical in wine which is alcohol

The most common value of alcohol found in low quality wines is 9.4 units. The most common percentage content of alcohol in medium quality wine is 10.4 and for the high quality wine the most common percentage content of alcohol is 11.

We can see from the scatter plot that high quality wines have little higher amount of alcohol compared to the low and medium quality wines. Interestingly one of the high quality wine also has one of the lowest value of the alcohol. We will analyze it further using box plot.

It can be clearly seen that the median value for alcohol in High quality is greater than medium and low quality alcohol, Overall the range is also very high for high quality alcohol compared with low and medium quality alcohol. The lowest value of the alcohol in a wine is 8.4 units in Low quality wine and the lowest value of the alcohol in medium and high quality is 8.5. The highest value of alcohol content in high quality wine is 14.2 units.

We would also like to analyze the relationships between some of the pretty strong correlations we observe in other variables. We will first start with density and alcohol. They have pretty strong negative correlation of -0.81.

## 
##  Pearson's product-moment correlation
## 
## data:  density and alcohol
## t = -92.033, df = 4512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8176455 -0.7973553
## sample estimates:
##        cor 
## -0.8077394

We can see that the p-value is less than 0.05, so at 95% confidence interval we can say that density and alcohol are significantly correlated with correlation coefficient of -0.81. We can see that from above plots that for high quality wines density was low but alcohol content was higher.

Next we take a look at the density and residual sugar.

From the scatter plot we can see positive correlation.

## 
##  Pearson's product-moment correlation
## 
## data:  density and new.residual.sugar
## t = 83.22, df = 4512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7663700 0.7893991
## sample estimates:
##      cor 
## 0.778146

We can see that the p-value is less than 0.05, so at 95% confidence interval we can say that density and residual sugar are significantly correlated with correlation coefficient of 0.78. This makes sense since we found out from the plots above that the density and residual sugars content were low for high quality wines.

Next we look at the very obvious correlation between free sulfur dioxide and total sulfur dioxide.

As expected we see a positive correlation between these two variables.

## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and total.sulfur.dioxide
## t = 52.035, df = 4512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5938361 0.6303126
## sample estimates:
##       cor 
## 0.6124002

We can see that the p-value is less than 0.05, so at 95% confidence interval we can say that total and free sulfur dioxide are significantly correlated with correlation coefficient of 0.61.

At the end of bivariate analysis we will take a look at the relation between total sulfur dioxide and density.

As we have seen from the correlation matrix, we see a positive correlation.

## 
##  Pearson's product-moment correlation
## 
## data:  density and total.sulfur.dioxide
## t = 44.527, df = 4512, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.5319151 0.5724609
## sample estimates:
##       cor 
## 0.5525148

We can see that the p-value is less than 0.05, so at 95% confidence interval we can say that density and total sulfur dioxide are significantly correlated with correlation coefficient of 0.55.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the data set?

There are many relationships which we observed as a part of bivariate analysis. Most of them involved relationship of independent variables such as alcohol, density, chlorides etc. with dependent variable quality and also with factor variable which has three classes of quality

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, we also analyzed the relationship between the independent variables such as total sulfur dioxide and density, density and residual sugar, density and alcohol. Basically we analyzed the relationships between the variables which were strongly correlated.

What was the strongest relationship you found?

The strongest relationship was found between alcohol and density. We can say that as the percent of alcohol content increases the density decreases.

Multivariate Plots Section

In bivariate analysis, we found out that density and alcohol were related to quality of alcohol compared to other variables, so first we begin with investigating the relationship between alcohol, density and quality over here.

From the above plot we can see that we are affirming our bivariate analysis conclusion that wines with high quality have higher percentage content of alcohol and lower density values.

We were aware about this relationship from bivariate analysis. We will try to see what other variables which have pretty good correlation either with alcohol or density affect the quality.

We will begin with analyzing alcohol and pH.

We do not see any relationship as such between alcohol and pH with respect to the quality. The High quality wines with high alcohol content can have the same pH as high or low quality wine with less alcohol content.

Next we check Alcohol and chlorides.

We can see that amount of chlorides is pretty higher among wines with less alcohol content. Most of the high quality and medium quality wines with alcohol content greater than 10 have chloride content less than 0.08 units. Low quality and medium quality wines that have percent alcohol content around 9 have higher amount of chlorides. Of course there is an exception of one medium quality wine which has 13.4 percent of alcohol has high chloride content as well.

Then we analyze density and total sulfur dioxide

We can see that most of the high quality wines have low density as well as low total sulfur dioxide.

Next, we move on to density and residual sugar.

This graph is quite interesting, We know that we have one of the highest correlation between density and residual sugar but there exists a pattern for quality of wines as well. we can see most of the high quality wines have smaller density values as well as small residual sugar values. Low quality wines have higher density as well as high residual sugar values.

We will also take a look at alcohol and residual sugar.

It is interesting to see that medium and low quality wines have more or less the same amount of residual sugar. There are some High quality wines with less alcohol content that have high residual sugar.

Now lets take a look at residual sugar and pH.

It is interesting to see that there is no trend between ph and residual sugar. Generally wines with lower pH or wines which are highly acidic have higher amount of residual sugar in order to reduce the bitterness of the acid.

We will also take a look at volatile acidity and alcohol.

We are not seeing much relationship here. It seems that the values of volatile acidity are same for all the wines irrespective of their alcohol content as well as their quality

We will also explore density and free sulfur dioxide.

We can say that amount of sulfur dioxide in the wines does not depend on density neither it is varying in any way based on the quality of the wine.

Then next alcohol and citric acid are analyzed

We cannot see much trend here between citric acid and alcohol.

Next, we move on to sulfates and density.

We can say that the sulfates value might slightly be lower for low quality wines and wines with higher density. Some of the higher values of sulfates can be observed for high quality wines with smaller values of density.

We will develop a linear regression model to statistically understand which variables are statistically significant to determine the quality of wine. In linear regression model we will use all the columns except for quality as independent variables and quality will be a dependent variable. We will consider all the columns first, and then we will remove the statistically insignificant variables (variables whose p_value > 0.05) one by one.

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = wine_dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3607 -0.5060 -0.0520  0.4591  3.0438 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.064e+02  2.542e+01   8.118 6.04e-16 ***
## fixed.acidity         1.451e-01  2.565e-02   5.659 1.62e-08 ***
## volatile.acidity     -1.823e+00  1.191e-01 -15.312  < 2e-16 ***
## citric.acid          -5.373e-02  1.001e-01  -0.537 0.591320    
## residual.sugar        9.596e-02  9.497e-03  10.104  < 2e-16 ***
## chlorides            -1.088e+00  8.248e-01  -1.320 0.187058    
## free.sulfur.dioxide   5.768e-03  9.752e-04   5.915 3.56e-09 ***
## total.sulfur.dioxide -1.098e-04  4.076e-04  -0.269 0.787704    
## density              -2.075e+02  2.576e+01  -8.055 1.01e-15 ***
## pH                    9.331e-01  1.230e-01   7.590 3.88e-14 ***
## sulphates             6.944e-01  1.059e-01   6.555 6.21e-11 ***
## alcohol               1.149e-01  3.167e-02   3.628 0.000289 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7419 on 4502 degrees of freedom
## Multiple R-squared:  0.2796, Adjusted R-squared:  0.2779 
## F-statistic: 158.9 on 11 and 4502 DF,  p-value: < 2.2e-16

We can see that this model has R^2 = 0.2796. We can also see that following variables are not statistically significant - citric acid, chlorides, total sulfur dioxide. We will remove total sulfur dioxide first as it has the highest p-values and develop linear regression model again.

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + citric.acid + 
##     residual.sugar + chlorides + free.sulfur.dioxide + density + 
##     pH + sulphates + alcohol, data = wine_dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3579 -0.5077 -0.0523  0.4584  3.0455 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.083e+02  2.437e+01   8.549  < 2e-16 ***
## fixed.acidity        1.460e-01  2.543e-02   5.742 9.96e-09 ***
## volatile.acidity    -1.829e+00  1.167e-01 -15.681  < 2e-16 ***
## citric.acid         -5.435e-02  1.000e-01  -0.543 0.586931    
## residual.sugar       9.657e-02  9.223e-03  10.470  < 2e-16 ***
## chlorides           -1.087e+00  8.247e-01  -1.319 0.187369    
## free.sulfur.dioxide  5.616e-03  7.949e-04   7.065 1.85e-12 ***
## density             -2.094e+02  2.469e+01  -8.483  < 2e-16 ***
## pH                   9.368e-01  1.222e-01   7.666 2.15e-14 ***
## sulphates            6.937e-01  1.059e-01   6.551 6.37e-11 ***
## alcohol              1.135e-01  3.125e-02   3.633 0.000283 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7418 on 4503 degrees of freedom
## Multiple R-squared:  0.2796, Adjusted R-squared:  0.278 
## F-statistic: 174.8 on 10 and 4503 DF,  p-value: < 2.2e-16

We can see that the R^2 is still the same and we have two variables whose p-values are > 0.05. We will eliminate citric acid as it has the highest p-value.

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     chlorides + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + pH + sulphates + alcohol, data = wine_dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3620 -0.5094 -0.0520  0.4579  3.0412 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.073e+02  2.536e+01   8.174 3.84e-16 ***
## fixed.acidity         1.439e-01  2.554e-02   5.634 1.87e-08 ***
## volatile.acidity     -1.813e+00  1.176e-01 -15.417  < 2e-16 ***
## residual.sugar        9.624e-02  9.481e-03  10.151  < 2e-16 ***
## chlorides            -1.116e+00  8.231e-01  -1.356 0.175148    
## free.sulfur.dioxide   5.727e-03  9.721e-04   5.892 4.11e-09 ***
## total.sulfur.dioxide -1.148e-04  4.074e-04  -0.282 0.778184    
## density              -2.084e+02  2.570e+01  -8.111 6.43e-16 ***
## pH                    9.385e-01  1.225e-01   7.659 2.28e-14 ***
## sulphates             6.924e-01  1.059e-01   6.540 6.84e-11 ***
## alcohol               1.134e-01  3.155e-02   3.595 0.000328 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7419 on 4503 degrees of freedom
## Multiple R-squared:  0.2796, Adjusted R-squared:  0.278 
## F-statistic: 174.8 on 10 and 4503 DF,  p-value: < 2.2e-16

We can see that this model has R^2 = 0.2796. We can also see that chlorides has p-value greater than 0.05 so we will remove it and develop linear regression model again.

## 
## Call:
## lm(formula = quality ~ fixed.acidity + volatile.acidity + residual.sugar + 
##     free.sulfur.dioxide + total.sulfur.dioxide + density + pH + 
##     sulphates + alcohol, data = wine_dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3694 -0.5075 -0.0510  0.4606  3.0474 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           2.135e+02  2.495e+01   8.558  < 2e-16 ***
## fixed.acidity         1.493e-01  2.522e-02   5.920 3.45e-09 ***
## volatile.acidity     -1.821e+00  1.175e-01 -15.505  < 2e-16 ***
## residual.sugar        9.887e-02  9.282e-03  10.652  < 2e-16 ***
## free.sulfur.dioxide   5.710e-03  9.721e-04   5.873 4.58e-09 ***
## total.sulfur.dioxide -1.134e-04  4.075e-04  -0.278 0.780869    
## density              -2.148e+02  2.526e+01  -8.505  < 2e-16 ***
## pH                    9.666e-01  1.208e-01   8.002 1.54e-15 ***
## sulphates             7.008e-01  1.057e-01   6.630 3.76e-11 ***
## alcohol               1.121e-01  3.154e-02   3.554 0.000383 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7419 on 4504 degrees of freedom
## Multiple R-squared:  0.2793, Adjusted R-squared:  0.2779 
## F-statistic: 193.9 on 9 and 4504 DF,  p-value: < 2.2e-16

We can see that we have all the statistically significant variables in the model. R^2 has pretty low value of 0.2793 which means only 27.93% of variation in quality can be explained by the independent variables.

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Some of the relationships which were explained in the investigation were relationship of sulfates with density and quality, relationship of volatile acidity with alcohol and quality, relationship of residual sugar with density and quality. The relationship of residual sugar and density strengthened the fact that low quality white wines have more residual sugar and more density.

Were there any interesting or surprising interactions between features?

While analyzing the relationship it was interesting to find that the residual sugar did not had any clear trend with the pH. Generally for higher acidic wines, more residual sugar is present to reduce the bitterness of the acid, however we did not see that trend in the data.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes, I did create linear regression model with the data set. The strength of the model is that helped me to identify the variables which were statistically significant to quality, however the R^2 was only 0.2793 which meant only 27.93% of the variation in quality could be explained by the independent variables.

Final Plots and Summary

Plot One

Description One

This box plot shows the distribution characteristics of alcohol with respect to the quality of wines. The reason to chose this plot because based on the correlation matrix we know that alcohol has the highest correlation with quality compared to other variables. From this box plot, we can see the median of high quality wines is higher than medium and low quality wines. The upper and lower quartile values of the high quality wines are larger than the medium and the Low quality wines. The boxplot for low quality wines show that many wines have similar alcohol content at certain parts of the scale but in other parts of the scale, the alcohol content varies. From the box plots We can say that high quality wines usually have higher percent content of alcohol.

Plot Two

Description Two

This plot shows that the high alcohol content wines have lower density, and have high quality. Moreover at lower alcohol content, the density is higher, and wine quality is lower, but also the variation in density is higher. The reason that this graph is chosen because density and alcohol have some of the highest correlation with the quality of the wine.

Plot Three

### Description Three This graph shows that high quality wines have lower density and residual sugar. The low quality wines have higher density and residual sugar content. The reason to chose this plot because from the linear regression model we know that residual sugar, density are statistically significant variables and also residual sugar and density have strong positive correlation with each other.

Reflection

The white wine data contains information on almost 4900 wine records across 12 variables. I started by understanding the individual variables in the data set, and using different visualizations explored different patterns in the data. Eventually, I explored quality of the wine across all the other variable and created a linear model to predict the quality of the wine.

From the different visualizations we observed that the alcohol content was one of the major factors in affecting the wine quality along with the density. We also realized that as the percent of alcohol content becomes higher in the wine, the wine quality increases however the density decreases. We also found out that at higher density the amount of residual sugar also increases.We also found that as amount of chlorides increase in the wine, its quality decreases. We struggled to find proper concrete relationship between other variables and alcohol and density like pH, volatile acidity, free sulfur dioxide. For the linear model, all the columns in the data were considered however the model was able to account only 28% of variation in the data set.

Some limitations of the model included the distribution of the data. Most of the wines present in the data set had quality score of 6, if there would have been more variation in the distribution, we might have got better understanding of how the variables affect the quality. None of the variables except for alcohol affected the quality significantly, so here we need more data here to do better linear fitting of the model. To investigate the data further, I would combine the red wine data set to understand the what are the differences between white and red wine. I would also like to see if the factors that affect the quality in white wines play the similar role in red wines.

References

  1. https://tools.thermofisher.com/content/sfs/brochures/XX-72102-Wine- Analysis-XX72102-EN.pdf P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  2. https://stackoverflow.com/questions/4787332/how-to-remove-outliers-from-a- data set

  3. https://www.theanalysisfactor.com/outliers-to-drop-or-not-to-drop/

  4. http://www.cookbook-r.com/